Audience: Diverse Background
Time: 3 day workshop (18 hours)
Pre-Requisites: Intro to Python (Data Science Campus).
Brief Description:
This course introduces the basics of carrying out a statistical analysis in Python. It covers exploratory data analysis, constructing and interpreting linear and generalized linear models, and introduces Bayesian modelling.
Aims, Objectives and Intended Learning Outcomes:
Chapter 1: Exploratory Data Analysis By the end of Chapter 1, learners should know:
Chapter 2: Model Basics By the end of Chapter 2, learners should know:
Chapter 3: Generalized Linear Models
By the end of Chapter 3, learners should know
What is probability?
What is a random variable?
What a probability distribution is and how it differs for continuous vs. discrete random variables?
Be familiar with several common probability distributions used to model variation in the response variable
How to implement a generalized linear model in python.
Chapter 4: Bayesian Analysis By the end of Chapter 4, learners should know
Acknowledgements: Many thanks to Dr. Paraskevi Pericleous for key initial work on this module. Many thanks to Daniel J. Lewis for preparing the GLM and Bayesian Practical.
Libraries Needed
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # data plotting
import seaborn as sns # data visualisation and plotting
import statsmodels.api as sm # statistical modelling package
import statsmodels.formula.api as smf # statistical modelling package with R-like formulas
import scipy.stats as stats
import math
#sns.set(color_codes=True)
from statsmodels.genmod.generalized_linear_model import GLM # importing packages to run GLM
from statsmodels.genmod import families # importing families for exponential families
from scipy.stats import binom # to illustrate the binomial distribution.
from sklearn import datasets, linear_model # fetching iris dataset and linear model functions
from sklearn.metrics import mean_squared_error, r2_score
# Seaborn plot default configurations
sns.set_style("white")
# set the custom size for my graphs
sns.set(rc={'figure.figsize':(8.7,6.27)})
The road map of the modelling process, adapted from Ecological Models and Data with R by Ben Bolker.
Serra Rio do Rastro. Source: Rosanetur https://flic.kr/p/QV7ttk.
Identify the data science question you wish to explore. Try to identify your question at both a general, conceptual level “Are bed nets an effective intervention against malaria” and at a specific level “what is the percentage reduction in malaria cases between area A where bed nets are being used and area B where bed nets are not being used”.
Explore the data. Explore the variation and covariation in the data. Identify patterns through plotting which can be explored further using formal statistical testing.
Choose the model: choose the mathematical description of the pattern you are trying to describe.
Fit parameters: once you have defined your model, you can estimate the parameters (slope, intercept, …). The parameters are effectively the answers to your questions from 1.
Estimate confidence intervals/test hypotheses/select models: measurements of uncertainty are necessary to contextualise your best-fit parameters. By quantifying the uncertainty in the fit of a model, you can estimate confidence limits for the parameters. You can then test your hypothesis statistially and practically, can we tell the difference statistically between the effect of bed nets to control malaria? Are these differences large enough to make bed nets an effective intervention strategy?
Exploratory data analysis is a fluid process and there is no single approach. It can be thought of a process of hypothesis generation, data exploration, and formal statistical testing. It comes after the stage of importing and tidying your data.
In this section we will walk through organising your data, getting to know your data structure, and understanding variation and covariation within and between variables.
For these exercises we will use the iris dataset, which consists of morphological measurements of three related species of iris flowers.
Source: Suruchi Fialoke, October 13, 2016, Classification of Iris Varieties.
The data is provided as a built-in dataset in the sklearn package which we will change to be formatted as a pandas dataframe.
To conduct regression analyses in python it is important to have your data in a format that is easy to work with. A good approach is to organise your data as tabular data.
Tabular data is a set of values, where each value is placed in its own “cell”, each variable in its own column, and each observation in its own row. The book R for Data Science by Garrett Grolemund and Hadley Wickham is a great resource when thinking about tidying data. Here, we use some of the definitions they set out in the book to describe tidy data.
Some definitions for tidy data:
A variable is a quantity, quality, or property that you can measure.
A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable.
In python, the pandas package makes it easy to format your data in a tabular format.
Reading in the data
# Define sklearn_to_df function to convert from sklearn to a pandas dataframes
def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
df['target'] = pd.Categorical.from_codes(sklearn_dataset.target, sklearn_dataset.target_names)
return df
Importing and converting iris data to a pandas dataframe
# import and convert format of iris data from sklearn
df_iris = sklearn_to_df(datasets.load_iris())
Looking at the head of the dataframe spot the tidy data
df_iris.head(5)
Exercise
value, observation, and variable. Which of the coloured boxes in the image above correspond to each definition?To get started, let’s explore the following questions for our dataset.
What is the structure of the data?
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
One of the things we will wish to know about our variables are whether they are continuous or categorical.
Continuous variable: a variable that can take on an unlimited number of values between the lowest and highest points of measurements.
Categorical variable can take one of a limited subset of values. For example, if you have a dataset about a household then you will typically find variables like gender, marriage status, and county.
Exercises
Run the following lines of code to answer the questions below
print(df_iris.shape)
df_iris.head(5)
df_iris.tail(5)
df_iris.info()
df_iris.describe()
# Let's simplify the column names and make them more meaningful
df_iris = df_iris.rename(columns = {'sepal length (cm)': 'sepal_length', 'sepal width (cm)': 'sepal_width', 'petal length (cm)': 'petal_length', 'petal width (cm)': 'petal_width','target': 'species'})
# What are the names of the columns?
df_iris.columns
Now we know more about the structure of our data we can explore the variation and covariation in the variables. Knowing the variation and covariation between variables can help us to understand the spread of the data and potential relationships in the data that may give insight into modelling.
Terms:
variation: is the tendency of values of a variable to change from measurement to measurement. Variation can come in several forms:
measurement error you may measure the same thing twice and get slightly different values.natural variation is the term I use to refer to variation that is inherent in a population or sample (e.g. as humans we all have different heights, the way these values vary reflect the variation in the sample or population).covariation: tendency of values of a variable to change with the values of another variable.Visualisation is a great initial tool to explore these relationships further.
How you visualise your variables depends on if the variable is categorical or continuous.
A categorical variable
Categorical or discrete variable: a variable that can take on one of a limited, usually fixed number of possible values, assigning each value to a particular group or nominal category.
To examine the distribution of a categorical variable, we can use a bar plot:
Bar plot of species
species_counts = sns.countplot(x= "species", data = df_iris)
species_counts
In this case, the bar chart shows that there are the same number of measurements for each species in the data set.
A continuous variable
continuous variable can take any of an infinite set of ordered values (e.g. numbers and date times). We can inspect the spread of the data using a density plot or box plot.petal_length_all_distplot = sns.distplot(df_iris['petal_length'],
hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
petal_length_all_distplot.set(xlabel='Petal_length', ylabel='Density')
## [Text(0, 0.5, 'Density'), Text(0.5, 0, 'Petal_length')]
petal_length_all_distplot
If we look at the distribution of petal length, something interesting seems to be happening. It appears that the distribution is bimodal meaning that there are two modes, in this case two maxima, in the data.
Let’s see if we can explore it further.
df_setosa = df_iris[df_iris.species == 'setosa']
petal_length_species = sns.distplot(df_setosa[['petal_length']], label = 'setosa',
hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
df_virginica = df_iris[df_iris.species == 'virginica']
petal_length_species = sns.distplot(df_virginica[['petal_length']], label = 'virginica',
hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
df_versicolor = df_iris[df_iris.species == 'versicolor']
petal_length_species = sns.distplot(df_versicolor[['petal_length']], label = 'versicolor',
hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3})
petal_length_species.set(xlabel='Petal Length', ylabel='Density x 10')
## [Text(0, 0.5, 'Density x 10'), Text(0.5, 0, 'Petal Length')]
petal_length_species
A continuous and categorical variable
petal_width_boxplot = sns.boxplot(data = df_iris, y = 'petal_width', x = 'species')
petal_width_boxplot
sepal_length_violin = sns.violinplot(data = df_iris, y = "sepal_length", x = 'species')
sepal_length_violin
Violin plots are similar to box plots, but they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
Two continuous variables
Plotting two continuous variables, we can see how they change in relation to eachother. In these plots we are looking to see whether there is a
positive relationship as one variable increases the other variable increases
negative relationship as one variable increases the other decreases
no relationship no discernable pattern of change in one variable with the other.
non-linear relationship we may also be able to pick out other patterns, e.g. polynomials.
iris_scatter_petal_length_sepal_width = sns.scatterplot(data = df_iris, x = 'petal_length', y = 'sepal_width', hue = 'species')
iris_scatter_petal_length_sepal_width
Exercise
Make a box plot or a violin plot of sepal width by species. How does this box plot/violin plot compare to the earlier box plot/violin plot we made of petal width and sepal length?
Make a scatterplot to visualise the relationship between petal length and sepal length coloured by species. What patterns can you pick out from the data?
Pairplots can be a quick and useful way to summarise your dataset quickly and to inspect the relationships simultaneously.Trying running the following code to make a pairplot. What does this code do?
iris_all_pairplot = sns.pairplot(data = df_iris, hue="species", diag_kind="kde")
iris_all_pairplot
Ideally your model will capture signals (i.e. patterns) generated by the phenomenom of interest and ignore noise (i.e. random variation) you’re not interested in.
In model basics you will learn how models work mechanistically, focussing on the important family of linear models.
Hypothesis generation vs. hypothesis confirmation
Traditionally, one of the focuses of modelling is on inference, or confirming that a hypothesis is true. To do it correctly you need to know two things:
Note, that in order to confirm a hypothesis, you must use data independent of the data used to generate the hypothesis.
One of the goals of models is to partition data into patterns and residuals. Strong patterns will hide subtler trends. We can use models to peel back layers of structure as we explore a dataset.
There are 2 key parts to a model, the family of models and the fitted model.
Family of models
Model Family. “Clothesline” by Frau Heimlich is licensed under CC BY-NC-SA 2.0
Fitted model
Model Fitting. “John Boyer Bespoke Suit Tailors” by Shayna Grosh is licensed under CC BY-NC-ND 4.0
Note, it is important to understand that a fitted model is just the closest model from a family of models. What is “best” is determined according to some criteria, which we will discuss later.
Clothes as an analogy
If we use clothes as an analogy, the family of models is like an assortment of garments you could choose to ‘clothe’ the data in. Just as some clothes will be more suitable than others depending on what you wish to do (e.g. nice dress to a wedding), the same is true for models. The type of model will depend on the type of data you have and what you wish to do with your analysis.
Model fitting is similar to getting a garment tailored. Just as you might make alterations to improve the fit of a garment, you will adapt the chosen model to get a better fit to the data.
Terminology
In the field of data science you will see a number of terms used to refer to the same things. Here, we will use response variable to refer to the measured variable you are trying to explain.
We will use explanatory variables to refer to the measured variables that we use to try to explain the response variable. Other terms that you may come across for these concepts include:
response variable: dependent, target (machine learning)
explanatory variables: independent, features (machine learning)
Linear regression is one of the most important and widely used regression techniques. Its main advantage is its simple structure and ease of interpreting results.
Linear models take the mathematical form:
\(y = ax + b\)
where \(y\) is the response variable, \(a\) is the slope that quantifies the change in \(y\) with increases in the dependent variable \(x\), and \(b\) is the intercept.
Selling irises to nostalgic students
Source: Suruchi Fialoke, October 13, 2016, Classification of Iris Varieties.
A data scientist is setting up a side business of breeding irises to sell to ex-statistics undergraduates who have rosy nostalgic memories of studying the iris dataset when they were budding statisticians. They are interested in the relationship between petal width and other metrics, like petal length, as they find that the irises with the widest petals sell better.
Let’s take a closer look at the iris data to explore the relationship between petal length and petal width.
iris_scatter_petal_length_width = sns.scatterplot(data = df_iris, x = 'petal_length', y = 'petal_width', hue = 'species')
iris_scatter_petal_length_width
From the plot it looks like there is a positive relationship between petal length and petal width. Let’s use a model to capture the pattern and make it more explicit. In this case, the relationship looks linear so we can use the form: \(y = ax + b\). We are interested in how petal width changes with petal length. We can make a guess at what the parameters for a and b might be. It looks like petal width increases more slowly than petal length, maybe at 0.3 cm for every 1 cm in petal length.
iris_scatter_petal_length_width = sns.scatterplot(data = df_iris, x = 'petal_length', y = 'petal_width', hue = 'species')
x = np.arange(7)
a = 0.3
b = 0.1
y = a*x + b
iris_scatter_petal_length_width = sns.regplot(x=x, y=y, marker="+")
iris_scatter_petal_length_width
Not bad, but maybe we could do a bit better.
iris_scatter_petal_length_width = sns.scatterplot(data = df_iris, x = 'petal_length', y = 'petal_width', hue = 'species')
x = np.arange(7)
a = 0.37
a1 = 0.35
b = -0.2
b1 = -0.3
y = a*x + b
y1 = a1*x + b1
iris_scatter_petal_length_width = sns.regplot(x=x, y=y, marker="+")
iris_scatter_petal_length_width = sns.regplot(x=x, y=y1, marker="+", line_kws={"color": "red"})
iris_scatter_petal_length_width
The two lines have slightly different parameters. It is difficult to tell by eye which line fits the data better. We could do this ourselves, by measuring the residuals, the distance from the data (actual response) to the line (predicted response), for each model, and comparing them. But this can be time-consuming, especially if our goal is to find the best model which could involve looking at more than just two models!
Luckily python has built-in functions that will explore all the possibilities to find the ‘best’ line. In linear regression, one of the ways to define the ‘best’ line is by finding the line that minimizes the sum of squared residuals (aka sum of squares).
Calculating the sum of squared residuals involves taking the residual distances, squaring them, and summing them. This approach is also called Ordinary Least Squares (OLS), after which the OLS function in python takes its name.
Here we use the statsmodels function api to construct our models. This allows us to specify an intuitive model formula for our analyses where formula = y ~ x.
In this case y is our response variable, the tilde means “depends on” x, which represents an explanatory variable (i.e. the variable we are using to try to explain the variation in y).
Let’s take a look at what this looks like in practice. In this case we’re interested in explaining the variation in petal_width using the explantory variable, petal_length.
Note that the smf.ols function automatically includes the intercept in the model, so we don’t have to specify one.
model1 = smf.ols(formula = 'petal_width ~ petal_length', data = df_iris)
results_mod1 = model1.fit()
print(results_mod1.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: petal_width R-squared: 0.927
## Model: OLS Adj. R-squared: 0.927
## Method: Least Squares F-statistic: 1882.
## Date: Tue, 21 Apr 2020 Prob (F-statistic): 4.68e-86
## Time: 10:35:48 Log-Likelihood: 24.796
## No. Observations: 150 AIC: -45.59
## Df Residuals: 148 BIC: -39.57
## Df Model: 1
## Covariance Type: nonrobust
## ================================================================================
## coef std err t P>|t| [0.025 0.975]
## --------------------------------------------------------------------------------
## Intercept -0.3631 0.040 -9.131 0.000 -0.442 -0.285
## petal_length 0.4158 0.010 43.387 0.000 0.397 0.435
## ==============================================================================
## Omnibus: 5.765 Durbin-Watson: 1.455
## Prob(Omnibus): 0.056 Jarque-Bera (JB): 5.555
## Skew: 0.359 Prob(JB): 0.0622
## Kurtosis: 3.611 Cond. No. 10.3
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The summary gives us a range of diagnostic information about the model we’ve fitted, split into three tables. Here’s a quick guide on what is included:
Top Table - model fit info.
Middle Table - an important table!
Bottom table - Diagnostics
As you can see, there is a lot of information in the summary. Let’s focus on a few relevant values from the model to focus on.
Results of the model fit
We can get these specific parameters directly from the model object.
print("Intercept = ",results_mod1.params['Intercept'])
## Intercept = -0.36307552131902865
print("(Petal length) coef. = ", results_mod1.params['petal_length'])
## (Petal length) coef. = 0.4157554163524114
print("R^2 = ", results_mod1.rsquared_adj)
## R^2 = 0.9266173379025906
Constructing the line of best fit
This is a useful way to pull out information to further analyse or include in reports. We can pull out these coefficients to plot our line of best fit.
iris_scatter = sns.scatterplot(data = df_iris, x = 'petal_length', y = 'petal_width', hue = 'species')
x = np.arange(7)
a = 0.37
a1 = 0.35
b = -0.2
b1 = -0.3
y = a*x + b
y1 = a1*x + b1
y2 = results_mod1.params['petal_length']*x + results_mod1.params['Intercept']
iris_scatter = sns.regplot(x = x, y = y, marker="+")
iris_scatter = sns.regplot(x = x, y = y1, marker="+", line_kws = {"color": "red"})
iris_scatter = sns.regplot(x = x, y = y2, marker = "+", line_kws = {"color": "blue"})
iris_scatter
We can do this plot by hand (as above) or directly from our model.
iris_best_fit = sns.regplot(x = 'petal_length', y = 'petal_width', ci = 95, data = df_iris)
iris_best_fit
Residuals
fig, ax2 = plt.subplots(figsize = (8,5))
x = df_iris.petal_length
y = df_iris.petal_width
xfit = np.arange(8)
yfit = results_mod1.params['petal_length']*xfit + results_mod1.params['Intercept']
slope = results_mod1.params['petal_length']
intercept = results_mod1.params['Intercept']
# Plot
ax2.scatter(x,y, marker = 'x', color = 'k')
ax2.plot(xfit, yfit, color = 'r')
# plot residuals
def residual_line(x, y, slope, intercept):
xs = [x, x]
ys = [y, (intercept + slope * x)]
return xs, ys
errs = [ax2.plot(*residual_line(x, y, slope, intercept), color = '0.5')]
ax2.set_xlabel("$X $", fontsize = 12)
ax2.set_ylabel("$Y $", fontsize = 12);
actual = df_iris.petal_width
predicted = results_mod1.params['petal_length']*df_iris.petal_length + results_mod1.params['Intercept']
difference = actual-predicted
x = df_iris.petal_length
fig, resid_plot = plt.subplots(figsize = (8,5))
resid_plot.scatter(x, difference, marker = 'o', color = 'k')
resid_plot.set_xlabel("Petal Length", fontsize = 12)
resid_plot.set_ylabel("Residuals", fontsize = 12);
# To get the sum of squared residuals we need to take the difference squared and then the square root.
difference_squared = difference**2
residuals = difference_squared.apply(math.sqrt)
print(residuals.sum())
## 23.470577056521144
Residuals
We can then inspect the residuals in the model. Residuals represent the left over variation in the response variable not explained by the model. A pattern in the residuals may indicate that we are missing a variable or that our assumption of normality is incorrect. When we are looking at the residuals, we want them to form an ‘amorphous cloud’, i.e. a cloud shape with no patterns.
df_iris_resid = sns.residplot(y = 'petal_width', x = 'petal_length', data = df_iris)
df_iris_resid
It looks like there might be some patterns left in the data that are unexplained. We can see that as petal_length increases the variability and range of residual values also increase.
Checking for normality of the residuals
One of the assumptions of a linear model is that the residuals are normally distributed. Note that this does not mean that the response variable needs to come from a normal distribution, just that the residuals are normally distributed. Normal distributions are often referred to as bell curves because of their shape. Let’s take a look at a normal distribution with mean 0 and variance 1.
Normal
mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()
What you’ll notice in the plot above is that it is symetrical with most values falling close to 0.
A common plot used to inspect model residuals is a qqplot. The qqplot compares the residuals to a theoretical normal distribution. If the residuals come from a normal distribution we would expect the blue dots to line up with the red line.
resid = results_mod1.resid
# Use statsmodels qqplot to graph residuals
# make a figure and an axis
f, ax = plt.subplots(figsize=(7,7))
# call the qqplot graph from statsmodels 'graphics' module.
# fits against the normal distribution as standard.
sm.graphics.qqplot(resid, line='45', fit=True, ax=ax);
The plot doesn’t look too bad. The residuals mostly match the red line, but we can see that the tails of the qqplot are a little skewed, particularly to the right. This may mean that there is still some structure in the variation in petal_width that is not accounted for by petal_length. We can see whether including species in the model helps explain some of the remaining variation in the model.
Now we are constructing our model to explain the variation in petal_width using the variables for petal_length and species. In this case, because species is a categorical variable we can specify it using C(species).
model2 = smf.ols(formula = 'petal_width ~ petal_length + C(species)', data = df_iris)
results_mod2 = model2.fit()
print(results_mod2.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: petal_width R-squared: 0.946
## Model: OLS Adj. R-squared: 0.944
## Method: Least Squares F-statistic: 845.5
## Date: Tue, 21 Apr 2020 Prob (F-statistic): 4.88e-92
## Time: 10:35:53 Log-Likelihood: 46.704
## No. Observations: 150 AIC: -85.41
## Df Residuals: 146 BIC: -73.37
## Df Model: 3
## Covariance Type: nonrobust
## ============================================================================================
## coef std err t P>|t| [0.025 0.975]
## --------------------------------------------------------------------------------------------
## Intercept -0.0908 0.056 -1.611 0.109 -0.202 0.021
## C(species)[T.versicolor] 0.4354 0.103 4.234 0.000 0.232 0.639
## C(species)[T.virginica] 0.8377 0.145 5.764 0.000 0.550 1.125
## petal_length 0.2304 0.034 6.691 0.000 0.162 0.298
## ==============================================================================
## Omnibus: 6.210 Durbin-Watson: 1.736
## Prob(Omnibus): 0.045 Jarque-Bera (JB): 9.578
## Skew: -0.110 Prob(JB): 0.00832
## Kurtosis: 4.218 Cond. No. 54.0
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
There are a few key things to pick out from the table above. First, we have an increase in the adjusted R^2, which is now 0.944, meaning we’ve explained 94.4 % of the variation in petal_length. We also have several variables now:
InterceptC(species)[T.versicolor]C(species)[T.virginica]petal_lengthBut remember that in our df_iris dataset there are three species of iris:
setosa
versicolor
virginica
Categorical variables always contribute to the intercept in a linear model. When fitting the model, the model function uses the first level of the categorical variable, in this case setosa, as the baseline Intercept. The coefficients estimated for C(species)[T.versicolor] and C(species)[T.virginica] are estimated in relation to that.
Let’s see how our fit looks when it is plotted.
iris_scatter_species = sns.scatterplot(data = df_iris, x = 'petal_length', y = 'petal_width', hue = 'species')
x = np.arange(7)
a = results_mod2.params['petal_length']
b_setosa = results_mod2.params['Intercept']
b_virginica = results_mod2.params['C(species)[T.virginica]'] + results_mod2.params['Intercept']
b_versicolor = results_mod2.params['C(species)[T.versicolor]'] + results_mod2.params['Intercept']
y_setosa = a*x + b_setosa
y_virginica = a*x + b_virginica
y_versicolor = a*x + b_versicolor
iris_scatter_species = sns.regplot(x = x, y = y_setosa, marker="+", line_kws = {"color": "blue"})
iris_scatter_species = sns.regplot(x = x, y = y_virginica, marker = "+", line_kws = {"color": "green"})
iris_scatter_species = sns.regplot(x = x, y = y_versicolor, marker = "+", line_kws = {"color": "orange"})
iris_scatter_species
As you can see from the plot above, all of the lines have the same slope so that they are parallel to one another. The only thing that changes is the intercept which moves the fitted line up or down.
Let’s see what the residuals look like for our new model.
resid2 = results_mod2.resid
fitted2 = results_mod2.fittedvalues
resid2_plot = sns.scatterplot(x = fitted2, y = resid2)
resid2_plot.set(xlabel='Fitted', ylabel='Residuals')
## [Text(0, 0.5, 'Residuals'), Text(0.5, 0, 'Fitted')]
resid2_plot
Let’s inspect the qqplot
# Use statsmodels qqplot to graph errors
# make a figure and an axis
f, ax = plt.subplots(figsize=(7,7))
# call the qqplot graph from statsmodels 'graphics' module.
# fits against the normal distribution as standard.
sm.graphics.qqplot(resid2, line='45', fit=True, ax=ax);
Now our residuals are slightly skewed to the left. Is there something more we might be missing from the model?
So far we’ve allowed the intercept in our model to vary by species. Let’s try letting our slope change also with species. By allowing our slope to vary by species we are exploring the hypothesis that how petal_width changes with petal_length varies by species.
To do this in the model, we need to use an interaction term. Interactions are included using the multiplication symbol \(*\) between two variables.
model3 = smf.ols(formula = 'petal_width ~ petal_length*C(species)', data = df_iris)
results_mod3 = model3.fit()
print(results_mod3.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: petal_width R-squared: 0.948
## Model: OLS Adj. R-squared: 0.946
## Method: Least Squares F-statistic: 521.9
## Date: Tue, 21 Apr 2020 Prob (F-statistic): 2.34e-90
## Time: 10:35:55 Log-Likelihood: 49.695
## No. Observations: 150 AIC: -87.39
## Df Residuals: 144 BIC: -69.33
## Df Model: 5
## Covariance Type: nonrobust
## =========================================================================================================
## coef std err t P>|t| [0.025 0.975]
## ---------------------------------------------------------------------------------------------------------
## Intercept -0.0482 0.215 -0.225 0.823 -0.473 0.376
## C(species)[T.versicolor] -0.0361 0.315 -0.114 0.909 -0.659 0.587
## C(species)[T.virginica] 1.1843 0.334 3.544 0.001 0.524 1.845
## petal_length 0.2012 0.146 1.380 0.170 -0.087 0.490
## petal_length:C(species)[T.versicolor] 0.1298 0.156 0.835 0.405 -0.178 0.437
## petal_length:C(species)[T.virginica] -0.0409 0.153 -0.268 0.789 -0.343 0.261
## ==============================================================================
## Omnibus: 6.984 Durbin-Watson: 1.804
## Prob(Omnibus): 0.030 Jarque-Bera (JB): 11.955
## Skew: -0.085 Prob(JB): 0.00254
## Kurtosis: 4.373 Cond. No. 176.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
If we look at the results, we can see that the slope for petal_length also uses setosa (the species name that is missing) as a baseline.
We can plot the model fit using our new coefficients. This time, both the slope and intercept will vary.
iris_scatter_species = sns.scatterplot(data = df_iris, x = 'petal_length', y = 'petal_width', hue = 'species')
x = np.arange(7)
##Slope
a_setosa = results_mod3.params['petal_length']
a_virginica = results_mod3.params['petal_length:C(species)[T.virginica]'] + results_mod3.params['petal_length']
a_versicolor = results_mod3.params['petal_length:C(species)[T.versicolor]'] + results_mod3.params['petal_length']
## Intercept
b_setosa = results_mod3.params['Intercept']
b_virginica = results_mod3.params['C(species)[T.virginica]'] + results_mod3.params['Intercept']
b_versicolor = results_mod3.params['C(species)[T.versicolor]'] + results_mod3.params['Intercept']
##Predicted Petal Width
y_setosa = a_setosa*x + b_setosa
y_virginica = a_virginica*x + b_virginica
y_versicolor = a_versicolor*x + b_versicolor
## Line of best fit
iris_scatter_species = sns.regplot(x = x, y = y_setosa, marker= "+", line_kws = {"color": "blue"})
iris_scatter_species = sns.regplot(x = x, y = y_virginica, marker = "+", line_kws = {"color": "green"})
iris_scatter_species = sns.regplot(x = x, y = y_versicolor, marker = "+", line_kws = {"color": "orange"})
iris_scatter_species
The residuals for the latest model are here:
resid3 = results_mod3.resid
fitted3 = results_mod3.fittedvalues
resid3_plot = sns.scatterplot(x = fitted3, y = resid3)
resid3_plot.set(xlabel='Fitted', ylabel='Residuals')
## [Text(0, 0.5, 'Residuals'), Text(0.5, 0, 'Fitted')]
resid3_plot
The qqplot for Model 3
# Use statsmodels qqplot to graph errors
# make a figure and an axis
f, ax = plt.subplots(figsize=(7,7))
# call the qqplot graph from statsmodels 'graphics' module.
# fits against the normal distribution as standard.
sm.graphics.qqplot(resid3, line='45', fit=True, ax=ax);
So how can we decide which model is better? There are a few metrics we can compare between the model that can be found in the top table. The three main ones we will discuss in this course are the Adjusted R^2, AIC, and BIC. Before getting to these concepts, let’s discuss the concept of parsimony and Occam's Razor.
Occam’s Razor: law of parsimony
Parsimony (aka Occam's razor) is a general argument for choosing simpler models even though we know the world is complex. Occam’s razor says that when presented with competing hypotheses that make the same predictions, we should select the solution with the fewest assumptions. This is to say that all other things being equal, we should prefer a simpler model to a more complex one, especially when the data don’t tell a clear story.
Model selection approaches often go beyond parsimony to say that a more complex model must not be just better than, but a specified amount better than, a simpler model.
Practical Considerations
There is also a practical element to parsimony; simple theories are easier to test than complex ones. Similarly, simple models often do a better job at predicting. Because a simpler model requires fewer parameters it is also less expensive in terms of time or money to collect the data for it.
We need to draw the line somewhere
“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk” ~ John Von Neumann
As we add more parameters to a model, we necessarily get an increasingly accurate fit to the particular data we have observed (the bias of our predictions decreases), but our precision for predicting future observations decreases as well (the varaince of our predictions increases).
One way to think about it is that data contain a fixed amount of information; as we estimate more and more parameters we spread the data thinner and thinner. Eventually the gain in accuracy from having more details in the model is outweighed by the loss in precision from estimating the effect of each of those details more poorly.
Adjusted R^2
R-squared is a goodness-of-fit measure for linear regression models. It indicates the percentage of the variance in the response variable explained by the explanatory variables. It is calculated by:
\(\frac{\text{Variance explained by the model}}{ \text{Total variance}}\).
Likelihood
The likelihood is the probability of the observed outcome (i.e. the data) given a particular choice of parameters. For a particular statistical model, maximum likelihood finds the set of parameters that makes the observed data most likely to have occurred. That is, we find the set of parameters that makes the likelihood as large as possible.
Maximum likelihood. Source: Ashan Priyadarshana. https://medium.com/quick-code/maximum-likelihood-estimation-for-regression-65f9c99f815d
The diagram above shows what is happening when you are calculating the maximum likelihood. Here, we have a line of best fit, with the estimated values of the response variable \(\hat y_{1...n}\) (red dots). The actual values of the response variable (our data), are reperesented by the black dots. The residuals are indicated by the \(\epsilon\). These residual is the distance between the actual value of the response variable and the estimated value of the response variable.
When we are calculating the maximum likelihood, we are looking for the parameters that maximise the likelihood of the data. The horizontal arrows trace up to the normal distribution which represents the fit. The closer to the peak of the distribution the data falls the “more likely” the data is given the parameters.
For mathematical convience, we often work with the logarithm of the likelihood (the log-likelihood) instead of the likelihood. However, the parameters that give the maximum log-likelihood also give the maximum likelihood.
Information criteria
Information criteria are based on the expected distance between a particular model and the “true” model. All information-theoretic methods reduce to finding the model that minimizes some criterion that is the sum of a term based on the likelihood (usually twice the negative log-likelihood) and a penalty term which is different for different information criteria.
Selecting models based on information criteria allows for the comparison of all candidate models at once, provided they are fit to the same data. If there are missing values in certain variables and not others, the model will exclude these data when fitting by default, so you need to be careful that you are not comparing models which have been fit to different datasets.
Akaike Information Criterion, or AIC, is the most widespread information criterion, and is defined as\(\text{AIC} = -2L + 2k\)
where \(L\) is the log-likelihood and \(k\) is the number of parameters in the model. As with all information criteria, small values represent better overall fits; adding a parameter with a negligible improvement in fit penalizes the AIC by 2 log-likelihood units.
Some rough guidance for AIC
For AIC, the rule of thumb people generally follow is that improvements of greater than 2 mean we can select the more complicated model.
Model Selection when there are several possibilities
Models with multiple parameters and possible interactions between variable lead to a large number of models to try. Two simple approaches to model selection include:
With too large a set of possibilities this can mean that you may arrive at a different best model depending on which approach you take. There are some algorithms to do a combination or forward and backward selection. However, you want to be careful that this kind of model selection does not devolve into data-dredging.
You should:
Exercises
Adjusted R^2
#\n will put the results on the next line!
print("Adjusted R^2 Model 1 = ", results_mod1.rsquared_adj, "\nAdjusted R^2 Model 2 = ", results_mod2.rsquared_adj
, "\nAdjusted R^2 Model 3 = ", results_mod3.rsquared_adj)
## Adjusted R^2 Model 1 = 0.9266173379025906
## Adjusted R^2 Model 2 = 0.944455796412463
## Adjusted R^2 Model 3 = 0.9458862539383163
AIC
print("AIC Model 1 = ", results_mod1.aic, "\nAIC Model 2 = ", results_mod2.aic, "\nAIC Model 3 = ", results_mod3.aic)
## AIC Model 1 = -45.59109158022039
## AIC Model 2 = -85.40821914792849
## AIC Model 3 = -87.39085622323768
Take a look at the main parameters of Model 2 and Model 3 from the model summary tables. Do they seem to vary much between the models?
Starting with the basic model sepal_width ~ sepal_length, explore the variables that best explain the variation in sepal_width within your dataset.
The traditional statistical approach was to assume that all variation in the data was normally distributed, or to transform the data until it was, and then use classical methods based on the normal distribution to draw conclusions.
In generalized linear models, variability isn’t just a nuisance, but actually tells us something about the processes we are interested in. What we treat as “signal” and what we treat as “noise” depends on our question. The same source of variability might provide an interesting insight into the data or be something we wish to account for so we can explore it further.
In the next section, we will introduce you to a few common distributions that capture different forms of variation in the response variable from the exponential family.
We will then give practical examples of how to fit and interpret these models using python.
Why is probability important?
Uncertainty and variability occur in many aspects of life. Probability helps us make sense and quantify uncertainties. Probability also helps us make informed decisions on what is likely to happened, based on patterns in the data collected.
Terms:
Probability is the chance of an event occurring, defined between 0 and 1.
Random Variable is the variable that takes values depending on outcomes of random events. It can be discrete if it can take distinct, separate values (gender) or continuous (age, weight).
Probability distribution the probability distribution for a random variable describes how the probabilities are distributed over the values of the random variable.
discrete random variable, x, the probability distribution is defined by a probability mass function, denoted by \(f(x)\).continuous random variable, because there is an infinite number of values in any interval (e.g. you can always go to smaller and smaller decimals), we instead consider the probability that a continuous random variable will lie within a given interval. Here the probability distribution is defined by a probability density function, also denoted by \(f(x)\).Both probability functions must satisfy two requirements:
Now that we know how to define a probability for discrete and continuous random variables we can take a look at a few common probability distributions. There is a large class of generalized linear models that we can use to model these different distributions.
We’ve already come across the normal distribution, which we use to inspect and compare our residuals. The normal distribution is arguably the most commonly used distribution in statistics. It is used for continuous variables (e.g. height, measurement error). It has several computational properties which make it easy to work with. For instance, it is symmetric, unimodal, and the mean, median, and mode are all equal. That means it can be captured just by specifying the mean and the variance.
You may use it to represent changes in population. Let’s say you know the population mean but wish to understand how much it varies (e.g. by +/- 10%, +/- 15%).
Below we have two plots of the normal distribution, where the mean is the same: 0, but with higher (5) or lower (1) variance.
mu = 0
variance = 1
sigma = math.sqrt(variance)
norm_1 = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(norm_1, stats.norm.pdf(norm_1, mu, sigma))
mu = 0
variance = 5
sigma = math.sqrt(variance)
norm_2 = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(norm_2, stats.norm.pdf(norm_2, mu, sigma))
plt.ylabel('Density')
plt.xlabel('Value')
plt.show()
The normal distribution is useful for modelling continuous data. In the earlier example, it was appropriate to assume that our response variable (petal width) was normally distributed given that it is a measured continuous variable.
The binomial distribution applies when you have a series of trials where the outcome of the trial can only take one of two values (e.g. black/white, heads/tails, alive/dead, present/absent, buyers/non-buyers).
The binomial distribution can be written as:
\(P(\text{k out of n}) = \frac{n!}{k!(n-k)!}p^k(1-p)^{n-k}\)
where n is the number of trials, k is the number of successes, and p is the probability of success.
Where the mean is \(\mu = np\) and the variance is \(\sigma^2 = np(1-p)\).
In the case where there is only one trial (e.g. one coin toss) we can use the bernoulli distribution. The bernoulli distribution is a special case of the Binomial distribution, and is used to describe an instance of the binomial distribution where there is only a single trial. So if we have one coin toss, it can only have two outcomes: heads (0) or tails (1).
from scipy.stats import binom
k = np.arange(2) ## This gives me the number of successes, in this case, heads (0), tails (1).
data_binom = binom.pmf(k, n = 1, p = 0.5) # If the coin is fair, we have a 50/50 chance of getting tails (1), i.e. 50% probability.
ax_binom = sns.barplot(x = k, y = data_binom)
ax_binom.set(xlabel='Tails', ylabel='Density')
## [Text(0, 0.5, 'Density'), Text(0.5, 0, 'Tails')]
ax_binom;
If we have two coin tosses, we have 4 possible combinations: heads/heads, tails/heads, heads/tails, tails/tails. Which result in 3 outcomes: heads/heads (0+0=0), tails/tails (1+1 = 2) and heads/tails (0+1 = 1) (because the order doesn’t matter).
from scipy.stats import binom
k = np.arange(3) ## This gives me the number of successes, in this case, heads/heads (0), tails/heads (1), tails/tails (2).
data_binom = binom.pmf(k, n = 2, p = 0.5) # If the coin is fair, we have a 50/50 chance of heads or tails, i.e. 50% probability.
ax_binom = sns.barplot(x = k, y = data_binom)
ax_binom.set(xlabel='Tails', ylabel='Density')
## [Text(0, 0.5, 'Density'), Text(0.5, 0, 'Tails')]
ax_binom;
I may wish to know how many students in my course like statistics before I begin. I can start by asking: * What is the probability that any random student likes statistics? * 0.7? 0.8?
I’m feeling pretty positive, there’s an 80% chance that a random student likes statistics. But what is the probability that the whole class likes statistics? I could ask: * What is the probability that all 10 students in the course like statistics? * 0.8 x 0.8 x 0.8 … x 0.8 = \(0.8^{10} \approx 0.107\) :-(
We can get to this answer by multiplying the probability of each student liking statistics, in this case \(0.8^{10}\) which is equal to \(\approx 0.107\) or 10.7% of students liking statistics.
10.7% doesn’t sound very promising. But I would settle for 6-9 people out of 10 liking statistics. Of course, I won’t know the answer until I attend the course, but I can get an idea of how probable it is that 6/10 or 7/10 of you like probability by working out the probability mass function:
from scipy.stats import binom
k = np.arange(11) ## This gives me the number of successes, in this case, students from 0 to 10 who like the course.
data_binom = binom.pmf(k, n = 10, p = 0.8) # I can compute the probability of 0/10, 1/10... 10/10 students liking the course. Where n is the number of trials (in this case the number of students).
ax_binom = sns.barplot(x = k, y = data_binom)
ax_binom.set(xlabel='Successes', ylabel='Density')
## [Text(0, 0.5, 'Density'), Text(0.5, 0, 'Successes')]
ax_binom;
The binom distribution function from scipy.stats takes n and p as shape arguments, where n is the number of trials, and p is the probability,
In the code above we have specified that we are asking 10 random students in each trial (n), that the probability that a student likes statistics is 0.8. From the probability mass function density plot it looks quite likely that between 6-9 students will like statistics. Hooray!
Some examples of where we might use the binomial are to model
Where the number of outcomes is greater than 2, e.g. (single, married, divorced), then we can use a multinomial distribution which allows for more than 2 possibilities. We can get our clue for the number of outcomes from the name (bi = 2, multi = multiple).
The Poisson distribution gives the distribution of the number of individuals, arrivals, events, counts, etc., in a given time/space/unit of counting effort.
This is the common distribution to use whenever things are counted and when they are discrete (e.g. you cannot have 1.5 of a person).
Bjørn Christian Tørrissen [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons"
The Poisson distribution expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event. For instance, we could use the Poisson distribution to model the number of cyclists crossing a bridge per day. However, the Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.
The Poisson distribution is given by the formula:
\[P(k\ \mathrm{events}) = \frac{\lambda^{k}e^{-\lambda}}{k!} \]
Where, \(\lambda\) is the average density of the events (e.g. cyclists per day), and the function returns the likelihood of seeing \(k\) cyclists.
In the Poisson distribution, the mean and the variance are the same, which is why the poisson function takes only the mean (mu or \(\mu\)) in the code below. \(\mu\) is also referred to as the shape parameter in the distribution.
from scipy.stats import poisson
k = np.arange(100, step = 5)
data_poisson = poisson.pmf(k = k, mu = 40)
ax_poisson = sns.barplot(x = k, y = data_poisson)
ax_poisson.set(xlabel='Count', ylabel='Density')
## [Text(0, 0.5, 'Density'), Text(0.5, 0, 'Count')]
ax_poisson
As mentioned before the poisson is the most common distribution we use to model counts (events, people, animals, etc.). However, we are constrained by the fact that the mean and variance are the same. This means that more counts, and as a result larger means allow for greater variation, whereas fewer counts with a smaller means allow for less variation. If we need to allow for more variation (i.e. wider distribution of counts) we can look to the negative binomial.
Fun Fact: One of the most famous historical, practical uses of the Poisson distribution was estimating the annual number of Prussian cavalry soldiers killed due to horse-kicks.
The negative binomial distribution is discrete, like the Poisson, but its variance can be larger than its mean, (i.e. it can be overdispersed). It is often used when the data largely follows a Poisson distribution but there are a large number of zeros in the data (e.g. few sightings and a large number of absences).
In fact, the negative binomial distribution is a generalisation of the poisson distribution, whose variance is considered separately from its mean, and hence works well when data are overdispersed. The negative binomial is usually characterised in similar terms to the binomial distribution - \(r\) (number of failures observed) and \(p\) (probability of success), with \(k\) now representing the probability of seeing \(k\) successes before \(r\) failures are observed given \(p\).
This is how it is done in scipy.stats, however it can also be parameterised in terms of its mean and variance. As a result, the mean and variance do not have to be equal as with the Poisson distribution, which makes it more flexible.
\[p = \frac{\sigma^{2} - \mu}{\sigma^{2}}\]
\[r = \frac{\mu^{2}}{\sigma^{2} - \mu}\]
Giving the Negative Binomial distribution as:
\[Pr(X = k) = \left( \begin{matrix} k + r - 1\\ k \end{matrix} \right)p^{k} \left( \frac{\mu}{\sigma^{2}} \right)^{r} \]
Where:
\(p\) is the probability of success in a Bernoulli trial with two outcomes (success or failure);
\(r\) is the dispersion or shape parameter;
\(k\) is the given number of successes of interest.
Note that although commonly called \(r\), scipy refers to this parameter as \(n\).
In scipy.stats, nbinom takes n and p as shape parameters where n is the number of successes, whereas p is the probability of a single success.
In the code below we are showing a different approach, in this case we are generating random values from the negative binomial distribution. We are plotting on the x axis the number of successes (e.g. these could be counts of animals or numbers of people) before we get our first failure (e.g. no animal or no person detected).
In this case size is the number of times we run the trial. n is the number of successes before a failure, and p is the probability of a success.
from scipy.stats import nbinom
data_nbinom = nbinom.rvs(n = 5, p = 0.5, size=1000)
ax_nbinom = sns.distplot(data_nbinom, kde = False)
ax_nbinom.set(xlabel='Negative Binomial', ylabel='Frequency')
## [Text(0, 0.5, 'Frequency'), Text(0.5, 0, 'Negative Binomial')]
ax_nbinom
From this we can see that 4 successes before we see a failure is the most frequent value, followed by 5 successes. You’ll notice that we also get a frequency for 0 to 30 successes. We can transform these into probabilities by dividing the frequency by the size (number of trials) we ran.
The negative binomial, may perhaps seem a little abstract to relate the number of successes before a failure to counts, but the important information to take away is that the negative binomial can be an effective way to model count data that is overdispersed (i.e. the data has a larger variance than the mean).
Exercise
Experiment with the poisson distribution. What happens when you experiment with the shape parameter, mu?
Experiment with the normal distribution. What happens when you change the variance? What happens when you change the mean?
In groups of 2-3, discuss some examples from your own work where you have a response variable that comes from a Binomial, Poisson, Negative binomial or Normal distribution?
A generalized linear model consists of 3 parts:
The exponential family probability distribution, is the probability distribution which our response variable follows, for instance, Normal, Binomial, Poisson, and Negative Binomial.
The linear predictor is a linear combination of a set of coefficients and explanatory variables used to predict our response variable. This is essentially the same form used to specify our linear modle, which means we can use the same formula set up for our GLM.
The link function allows us to use the linear predictor by providing the link to connect the exponential distribution to the linear predictor. There will often be defaults for the link when you call the model family, but you can also specify this yourself. For instance, the Poisson distribution is a discrete distribution where values can take on integers between 0 and infinity. However, the linear predictor is based on the normal distribution which is continuous and can go from -infinity to infinity. It is therefore common to use a log link to log the response variable, which effectively limits the normal distribution to 0 to infinity, thereby making it suitable for the Poisson distribution. This means that we often need to do a back transformation when we interpret the parameter coefficients in the actual scale. For example, when using the log-link we would exponentiate the coefficients.
Specifying the exponential family
To specify the exponential family we add an additional argument in our model specification, family = sm.families.Binomial(). There are several families we can specify, here are the main ones we have covered so far:
sm.families.Binomial()sm.families.Gaussian() *i.e. normal distributionsm.families.Poisson()sm.families.NegativeBinomial()You change the link from the default, according to the model you wish to fit, by specifying it as follows:
sm.families.Binomial(link = 'probit')We then specify the formula the same as before ( y x).
# formula = 'y ~ x'
# mod1 = smf.GLM(formula = formula, data = df, family = sm.families.Binomial()).fit()
# mod1.summary()
#
# mod1 = smf.GLM(formula = formula, data = df, family = sm.families. .fit()
# mod1.summary()
In this course we have covered linear regression and a few examples of generalized linear models. There is a whole range of models we can fit. Some of the topics we haven’t covered is capturing structure in our data (e.g. repeated measures, spatial and temporal correlation, random effects, generalised additive models etc.).
Statisical Model Ecosystem. Source: Ecological Models and Data in R by Ben Bolker 2007, pg 397.
For this practical we will be using the python notebook: Count_GLM.ipynb.
What do Alan Turing and Nate Silver have in common?
Alan Turing was a famous British mathematician who cracked messages from the enigma machine during WWII, substantially shortening the war.
Nate Silver is founder of the data-driven politics, economics, and sport website fivethirtyeight.com. Nate Silver successfully predicted the outcome of the two Obama elections using poll data, name recognition, and other information sources.
So what do Nate Silver and Alan Turing have in common?
They both worked on complex problems, with large inherent uncertainty that needed to be quantified, and with analyses that required the efficient integration of many sources of information. To solve these problems they used Bayesian Analyses.
Method for figuring out unknowns that requires 3 things
Generative Model
When we know the parameters we can simulate data using a generative model (e.g. linear model, generalized linear model, polynomial, etc.).
However, it is often the case that we have the data and we want to estimate the parameters to explore our hypotheses.
Using Bayesian inference, we can estimate parameters from data
In order to understand Bayesian Inference, we first need to learn about:
Probability is the chance of an event occurring, defined between 0 and 1.
Random Variable is the variable that takes values depending on outcomes of random events.
Probability distribution describes how the probabilities are distributed over the values of the random variable.
For a discrete random variable, x, the probability distribution is defined by a probability mass function, denoted by \(f(x)\).
For a continuous random variable, because there is an infinite number of values in any interval (e.g. you can always go to smaller and smaller decimals), we instead consider the probability that a continuous random variable will lie within a given interval. Here the probability distribution is defined by a probability density function, also denoted by \(f(x)\).
To define a probability, we first need to identify the sample space, that is the set of all possible outcomes that could occur. Then the probability of an event A is the frequency with which that event occurs.
Here is a short guide to some probability notation that we will see.
The probability must be non-negative for each value of the random variable and the sum of probabilities must equal 1.
If two events are mutually exclusive (e.g. “patient has HIV” and “patient does not have HIV”) then the probability that either occurs, the probability of \(A\) or \(B\), or \(P(A \cup B)\) is the sum of their individual probabilities. For example, the probability of being HIV positive (HIV) or HIV negative (HIV-) is \(P(\text{HIV} \cup \text{HIV-})\) is equal to \(P(\text{HIV}) + P(\text{HIV-})\).
If two events A and B are not mutually exclusive, i.e. the joint probability that they occur together, \(P(A \cap B)\), is greater than 0, then we have to correct the rule above for combining probabilites to account for double-counting.
The conditional probability of \(A\) given \(B\), \(P(A|B)\), is the probability that \(A\) happens if we know or assume \(B\) happens. The conditional probability equals \(P(A|B) = \frac{P(A \cap B)}{P(B)}\)
Bayes’ Rule is essentially a recipe for turning around a conditional probability so we can learn about the thing we are interested in, our hypothesis!
In words this states:
There’s been an outbreak of a zombie virus and it’s sweeping through the UK.
Luckily the probability of having the zombie virus, \(I\), is low. With only 1 in a million affected,
A test for the virus exists that has no false negatives, i.e. no one who is positive will not be detected.
However, the test does have some false positives, with 1 in 100 healthy individuals wrongly classified as having the zombie virus:
You’ve not been feeling well the last few weeks. You decide to take the test and it comes back positive for the zombie virus.
You know that 1 in 100 health individuals are wrongly classified as having the zombie virus, but still you need to convince your friends to not banish you.
* Luckily Bayes' Theorem is here to help you.
Bayes’ Theorem can be written as:
Replacing Hypothesis with “Infected” and Data with “Test positive” you get:
However, the probability of Testing Positive is unknown. Since you are either infected or uninfected (U), these events are mutually exclusive,
\(P(\text{Test positive}) = P(\text{Test positive}\cap\text{Infected}) + P(\text{Test positive}\cap\text{Uninfected})\)
This is equivalent to
because \(P(\text{Test positive}\cap\text{Infected}) = P(\text{Test positive}|\text{Infected})P(\text{Infected})\).
Because the population is made up of Infected and Uninfected people, we can get the proportion of the population uninfected using \(P(\text{Uninfected})= 1 - P(\text{Infected})\).
We can then take the elements we know and use these to determine \(P(\text{Test Positive})\).
\(P(\text{Test positive}) = {\color{blue}{P(\text{Test positive}|\text{Infected})}}{\color{red}{P(\text{Infected})}} + {\color{orange}{P(\text{Test positive}|\text{Uninfected})}}(1 - {\color{red}{P(\text{Infected})}})\)
This gives us
\(= {\color{blue}{1}}\times{\color{red}{10^{-6}}} + {\color{orange}{10^{-2}}}\times(1 - {\color{red}{10^{-6}}})\)
\(=10^{-6} + 10^{-2}+ 10^{-8}\) \({\color{green}{\approx 10^{-2}}}\)
Since \(10^{-6}\) is ten thousand times smaller than \(10^{-2}\), and \(10^{-8}\) is a million times smaller, we can neglect them for now and say that the probability of testing positive is around \(10^{-2}\).
Now we can plug into our formula:
The probability of being infected given the test is positive is:
Phew! So even though false positives are unlikely, the chance you are is even lower with only 1 in 10,000.
But Bayes’ Rule is general, our hypothesis and data can be any events! We are often using it in our models to assess our parameters.
Seeing as we are really interested in the parameters in our model, we can write Bayes’ Theorem as follows:
\[ P(Parameters \mid Data) = \frac{P(Data \mid Parameters) \, P(Parameters)}{P(Data)} \]
Bayes’ Theorem in layman’s terms:
This translates in Bayesian terms to:
Prior Knowledge: Prior
Data: Data comes in on its own and also as part of the Likelihood
Current Knowledge: Posterior
Terms
Prior distribution: what you know about the parameters, before getting the data - denoted \(P(Parameters)\).
Likelihood: based on modeling assumptions, how likely are the data given the choice of parameters - denoted \(P(Data|Parameters)\)
Posterior: probability of the chosen parameters given the Data, \(P(Parameters|Data)\).
One of the benefits of the Bayesian approach is that we can keep adding data and updating the posterior (our current knowledge of the process) as data becomes available.
Priors come from all information external to the current study (i.e. everything else)
If we don’t have strong evidence or prior beliefs and sometimes even if we do, we use what are called “flat” or “uninformative priors” which are designed to allow the data to “speak for themselves” in estimating a particular parameter we are interested in.
In Bayesian statistics we use priors and posteriors to represent our uncertainty about the world.
For example, let’s imagine my friend Elias and I have gone to a pub quiz and we’re asked:
Elias, has recently moved to Halifax, Nova Scotia where rescue efforts to save the survivors were coordinated. When touring the maritime museum, he remembers reading that the Titanic sank before WWI began. In fact, he is pretty confident it sank between 1910 and 1920, with only a small possibility it occurred after or before those dates.
Alas, aside from seeing Titanic three times in the cinema when it came out (“I’ll never let go, Jack!”<3). I can’t remember when the boat actually sank. Based on the clothes people were wearing in the film and the fact that no one was travelling by plane, I reckon it could have sank sometime between 1890 and 1930.
Because we’re both fans of Bayesian Statistics we decide to scribble down our priors on a napkin.
Both Elias and I have different priors that we can confront with data, e.g. studies, information on the internet, talking to one another. In this way we can arrive at a more precise guess to give at our pub quiz (posterior), using the information we learn to make our distribution tighter and tighter around the true answer.
Exercise
Without looking it up, draw a prior distribution for the percentage of the world that is water.
Remember that the probability distribution over all values must sum to 1 and that the highest point of the distribution is where you think the answer most probably lies.
Compare with your neighbours. How do your guesses vary?
Who would win in a fight between Thomas Bayes and Ronald Fisher
A long standing debate in statistics is over the validity of two popular approaches: Bayesian vs. Frequentist Statistics. When you research it you often get discussions of philosophy rather than statistics.
If you’ve taken an introductory statistics course in University you’ve probably learned frequentist statistics. Thought of as “classical statistics”, frequentist statistics covers p-values, null and alternative hypotheses.
Frequentist statistics:
Advantages of Bayesian Approach
You can incorporate prior observations and prior knowledge. This is especially useful if you’ve carried out previous experiments or if there are existing studies.
It is useful in complex models with missing data and several layers of variability
It can be used to make management decisions based on our data, including evaluating unlikely scenarios.
Disadvantages of Bayesian Approach
In Bayesian statistics, we have to specify our prior beliefs about the probability of different hypotheses.
These prior beliefs actually affect our answers.
This means we can confirm our own beilefs and get back what we put in.
To deal with subjectivity we can
Set a “flat prior” or uninformative prior to let the data speak for themselves.
Conduct sensitivity analyses, i.e. try different priors and see whether they influence the results.
Use priors that try to protect against subjectivity, e.g. penalized-complexity priors.
Take a look at the python notebook, Bayesian.ipynb, to complete the practical.
Bayesian vs. Frequentist Zombies: https://www.theregister.co.uk/2017/06/22/bayesian_vs_frequentist_ai/
Frequentist approaches: https://www.theregister.co.uk/2017/05/24/fear_of_statistics/
Book: Rethinking Statistics